Evaluating Elevation, Experience & Body Type impact on Football Players
Fall 2024 Data Science Project
Luke Walker, Evan Losin, Owen Davitz

Contributions: For each member, list which of the following sections they worked on, andsummarize the contributions in 1-2 sentences. Be specific!
A: Project idea: Evan had the Idea to look into sports statistics, Luke found the actual Dataset.
B: Dataset Curation and Preprocessing: Since there were 3 of us in the group and 3 sections needed for each step, we split it accordingly. Luke did the piece about how experience effects quarterbacks, Evan did how height linked to yards per reception, and Owen did how elevation affects field goal accuracy.
C: Data Exploration and Summary Statistics: Luke did the piece about how experience effects quarterbacks, Evan did how height linked to yards per reception, and Owen did how elevation affects field goal accuracy.
D: ML Algorithm Design/Development: As we each had the most experience with our own data that we had done previously, we split the machine learning models accordingly.
E: ML Algorithm Training and Test Data Analysis. We continued working on the parts we had from the start.
F: Visualization, Result Analysis, Conclusion: We each did these for the Machine learning part we took.
G: Final Tutorial Report Creation: We all met to compile the components we worked on into the required format. \

Intro¶

For our project we chose to investigate how different stats related to football affect the performance of players. As we all have an interest in the NFL we wanted to see if we could uncover lesser know relations between different stats. We each decided to investigate into a different area. Luke looked into how a quaterback's experience effects their play, Evan looked into the relationship between height and yards per catch, and Owen looked into how a kickers birth elevaiton affects their performance at different elevations. Finding the asnwers to how these stats are realted could help teams decide which players to keep on their teams, and it also provides additional insight to fans.

Data Curation¶

Link to data set: https://www.kaggle.com/datasets/kendallgillies/nflstatistics?select=Basic_Stats.csv. In addition, we had to acquire elevation data which we did partially from the Open-Elevation API and partially from just finding the elevations on Google.

In [52]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from geopy.geocoders import Nominatim
import requests
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score


#Import Career Passing stats
Career_Passing = pd.read_csv('Career_Stats_Passing.csv')
basic_stats = pd.read_csv('Basic_Stats.csv')
career_rec = pd.read_csv('Career_Stats_Receiving.csv')
rec_game_logs = pd.read_csv('Game_Logs_Wide_Receiver_and_Tight_End.csv')

#basic_stats
#Convert the to a numeric type so that it can be properly processed
Career_Passing['Passing Yards'] = pd.to_numeric(Career_Passing['Passes Completed'], errors='coerce') * 10
Career_Passing['Games Played'] = pd.to_numeric(Career_Passing['Games Played'], errors='coerce')
Career_Passing['Yards per Game'] = Career_Passing['Passing Yards'] / Career_Passing['Games Played']

#load csv files
basic_stats_df = pd.read_csv('Basic_Stats.csv')
kicker_career_df = pd.read_csv('Career_Stats_Field_Goal_Kickers.csv')
kicker_game_df = pd.read_csv('Game_Logs_Kickers.csv')

#set player ids to strings
basic_stats_df['Player Id'] = basic_stats_df['Player Id'].astype(str)
kicker_game_df['Player Id'] = kicker_game_df['Player Id'].astype(str)
kicker_career_df['Player Id'] = kicker_career_df['Player Id'].astype(str)

#rename career field goal percentage
kicker_career_df['Career FG Percentage'] = kicker_career_df['FG Percentage']
kicker_career_df['Career FG Percentage'] = pd.to_numeric(kicker_career_df['Career FG Percentage'], errors='coerce')
kicker_career_df.dropna(inplace=True)

#remove kickers that didnt play a whole season
kicker_career_df['Total Games'] = kicker_career_df.groupby('Player Id')['Games Played'].transform('sum')
kicker_career_df = kicker_career_df[kicker_career_df['Total Games'] > 16]

#get birthplace of all kickers
joined_df = pd.merge(basic_stats_df, kicker_career_df, left_on=basic_stats_df['Player Id'], right_on=kicker_career_df['Player Id'])
joined_df.drop(['Position_x', 'Position_y'], axis=1)
kickers_df = joined_df[['Player Id_x', 'Birth Place']]
kickers_df = kickers_df.drop_duplicates()

#code used to get elevations for player birthplaces


elevation_df = pd.DataFrame()
elevation_df['Location'] = kickers_df['Birth Place'].unique()
elevation_df['Elevation'] = None

def get_coordinates(city):
    geolocator = Nominatim(user_agent="city_elevation", timeout=20)
    location = geolocator.geocode(city)
    if location:
      return (location.latitude, location.longitude)
    else:
      location = geolocator.geocode(city.split()[-1])
      if location:
        return (location.latitude, location.longitude)
      else:
        print(f"Could not find coordinates for {city}")
        return None

def get_elevation(lat, lon):
  url = f'https://api.open-elevation.com/api/v1/lookup?locations={lat},{lon}'
  response = requests.get(url)
  if response.status_code == 200:
    elevation_data = response.json()
    return elevation_data['results'][0]['elevation']
  else:
    print("Error retrieving elevation data")
    return None

def get_elevations(cities):
  for city in cities:
    coords = get_coordinates(city)
    if coords:
      elevation = get_elevation(coords[0], coords[1])
      if elevation is not None:
        elevation_df.loc[elevation_df['Location'] == city, 'Elevation'] = elevation

#code to generate csv
#get_elevations(elevation_df['Location'])
#elevation_df.to_csv('birth_elevations.csv')

#there were a couple that didn't work so I filled them in manually

#import birth elevation
birth_elevation_df = pd.read_csv('birth_elevations.csv')
birth_elevation_df.sort_values('Elevation')

#code to get elevations for all team locations
#I just did these manually
team_elevation_df = pd.read_csv('team_elevations.csv')
team_elevation_df

#map team names to abbreviations
team_mapping = {
    "New Orleans Saints": "NO",
    "Oakland Raiders": "OAK",
    "New York Giants": "NYG",
    "Washington Redskins": "WAS",
    "Carolina Panthers": "CAR",
    "Buffalo Bills": "BUF",
    "New York Jets": "NYJ",
    "Pittsburgh Steelers": "PIT",
    "Baltimore Ravens": "BAL",
    "Detroit Lions": "DET",
    "Miami Dolphins": "MIA",
    "Dallas Cowboys": "DAL",
    "San Francisco 49ers": "SF",
    "Houston Texans": "HOU",
    "Cleveland Browns": "CLE",
    "St. Louis Rams": "STL",
    "San Diego Chargers": "SD",
    "Minnesota Vikings": "MIN",
    "Cincinnati Bengals": "CIN",
    "Arizona Cardinals": "ARI",
    "Green Bay Packers": "GB",
    "Tennessee Titans": "TEN",
    "Seattle Seahawks": "SEA",
    "Atlanta Falcons": "ATL",
    "Kansas City Chiefs": "KC",
    "Jacksonville Jaguars": "JAX",
    "Tampa Bay Buccaneers": "TB",
    "Denver Broncos": "DEN",
    "Indianapolis Colts": "IND",
    "Chicago Bears": "CHI",
    "New England Patriots": "NE",
    "Philadelphia Eagles": "PHI",
    "Los Angeles Raiders": "RAI",
    "Los Angeles Rams": "RAM",
    "Phoenix Cardinals": "PHO",
    "Boston Patriots": "BOS"
}

Exploratory Data Analysis (LUKE)¶

The first hypothesis is that a quarterback who played more games would have a higher avg of passing yards per game when compared to one with less games played.

In [53]:
#The data we need is already converted into a format which we can process above
#Since there is so much data, it needs to be cleaned so that each data point makes sense in this context.

#This dops the missing or invalid rows from the two columns below
valid_data = Career_Passing.dropna(subset=['Yards per Game', 'Games Played'])

#Find the median games played so we know where to split it
median_games = valid_data['Games Played'].median()

#Split the data into two sections, those who played more and less than the median
more_games_group = valid_data[valid_data['Games Played'] > median_games]
fewer_games_group = valid_data[valid_data['Games Played'] <= median_games]

#Calculate the avg yards per game for each quarterback in the group
more_games_avg_ypg = more_games_group['Yards per Game'].mean()
fewer_games_avg_ypg = fewer_games_group['Yards per Game'].mean()

#Print findings
print(f'Avg passing yards per game of a quarterback with more games: {more_games_avg_ypg}')
print(f'Avg passing yards per game of a quarterback with less games: {fewer_games_avg_ypg}')

#Make Box Plot to represent findings
data = [fewer_games_group['Yards per Game'], more_games_group['Yards per Game']]
labels = ['Fewer Games Played', 'More Games Played']
plt.boxplot(data, labels=labels)
plt.title('Comparison of Passing Yards per Game: Fewer vs More Games Played')
plt.ylabel('Passing Yards per Game')

plt.show()
Avg passing yards per game of a quarterback with more games: 71.88551732693666
Avg passing yards per game of a quarterback with less games: 54.688227919767066
No description has been provided for this image

Three Conclusions Drawn From This Data:

  • Descriptive statistics
  • Correlation analysis
  • Hypothesis testing

Descriptive Statistics
The data description shows the overview of the data set, including things such as the total count of how many data points were used, the mean, standard deviation, min, max, and quartile stats for the games played, pass attempts and passer rating. It helps us understand the overall distribution of these key statistics.

In [54]:
dataset_summary = Career_Passing.describe()
print(dataset_summary)

plt.hist(valid_data['Games Played'], bins=15)
plt.title('Distribution of Games Played')
plt.xlabel('Games Played')
plt.ylabel('Frequency')
plt.show()
              Year  Games Played  Pass Attempts Per Game  Passing Yards  \
count  8525.000000   8525.000000             8525.000000    4347.000000   
mean   1982.052551     10.294311                5.787824     687.941109   
std      23.822176      5.305723               10.533562    1021.944299   
min    1924.000000      0.000000                0.000000       0.000000   
25%    1965.000000      6.000000                0.000000      10.000000   
50%    1985.000000     12.000000                0.000000     110.000000   
75%    2003.000000     15.000000                5.900000    1080.000000   
max    2016.000000     17.000000               51.000000    4710.000000   

       Passer Rating  Yards per Game  
count    8525.000000     4337.000000  
mean       32.226111       62.971635  
std        40.485956       73.739956  
min         0.000000        0.000000  
25%         0.000000        0.714286  
50%         0.000000       25.000000  
75%        64.900000      120.714286  
max       158.300000      294.375000  
No description has been provided for this image

Correlation Analysis

There is a weak positive correlation (0.13) between the number of games played and passing yards per game. While not a strong relationship, quarterbacks who play more games tend to have slightly better averages.

In [55]:
correlation = valid_data['Games Played'].corr(valid_data['Yards per Game'])

print(f"Correlation between Games Played and Yards per Game: {correlation}")

plt.scatter(valid_data['Games Played'], valid_data['Yards per Game'])
plt.title('Games Played vs Yards per Game')
plt.xlabel('Games Played')
plt.ylabel('Yards per Game')
plt.show()
Correlation between Games Played and Yards per Game: 0.13059217810090026
No description has been provided for this image

Hypothesis Testing

The two-sample t-test (t = 7.63, p-value = 2.92e-14) shows a statistically significant difference between quarterbacks who played more games and those who played fewer games. Players with more games on average tend to have significantly higher passing yards per game.

Null Hypothesis (H₀): There is a significant difference in the average passing yards per game between quarterbacks who played more games and those who played fewer games.

Alternative Hypothesis (H₁): There is no significant difference in the average passing yards per game between quarterbacks who played more games and those who played fewer games.

Based on the two-sample t-test (t = 7.63, p-value = 2.92e-14), we reject the null hypothesis that there is a significant difference in the average passing yards per game between quarterbacks who played more games and those who played fewer games.

In [56]:
t_stat, p_value = stats.ttest_ind(
    more_games_group['Yards per Game'],
    fewer_games_group['Yards per Game'],
    equal_var=False
)

print(f'T-statistic: {t_stat}')
print(f'P-value: {p_value}')

means = [fewer_games_avg_ypg, more_games_avg_ypg]
labels = ['Fewer Games Played', 'More Games Played']
plt.bar(labels, means, yerr=[stats.sem(fewer_games_group['Yards per Game']), stats.sem(more_games_group['Yards per Game'])])
plt.title('Average Yards per Game by Group')
plt.ylabel('Average Yards per Game')
plt.show()
T-statistic: 7.6318812441510415
P-value: 2.919898696659963e-14
No description has been provided for this image

Primary Analysis/Visualizations (LUKE)¶

In [57]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.impute import SimpleImputer
from sklearn import tree
import matplotlib.pyplot as plt


features = ['Games Played', 'Passes Attempted', 'Passes Completed', 'Completion Percentage', 'TD Passes', 'Ints', 'Passer Rating']
target = 'Yards per Game'

valid_data = Career_Passing.dropna(subset=features + [target])

X = valid_data[features]
y = valid_data[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train.reset_index(drop=True, inplace=True)
X_test.reset_index(drop=True, inplace=True)
y_train.reset_index(drop=True, inplace=True)
y_test.reset_index(drop=True, inplace=True)

non_numeric_columns = X_train.select_dtypes(include=['object']).columns
print(f"Non-numeric columns: {non_numeric_columns.tolist()}")

for column in non_numeric_columns:
    X_train[column] = pd.to_numeric(X_train[column], errors='coerce')
    X_test[column] = pd.to_numeric(X_test[column], errors='coerce')

imputer = SimpleImputer(strategy='mean')
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)

dt_regressor = DecisionTreeRegressor(random_state=42)

param_grid = {
    'max_depth': [3, 5],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
}

grid_search = GridSearchCV(
    estimator=dt_regressor,
    param_grid=param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1,
    error_score='raise'
)


try:
    grid_search.fit(X_train, y_train)
except Exception as e:
    print(f"An error occurred during model fitting: {e}")
    raise

best_dt = grid_search.best_estimator_

y_pred = best_dt.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)

print(f'Best Parameters: {grid_search.best_params_}')
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r_squared}')

importances = best_dt.feature_importances_
feature_importance_df = pd.DataFrame({'Feature': features, 'Importance': importances})
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 6))
plt.bar(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.title('Feature Importances from Decision Tree Regressor')
plt.xlabel('Features')
plt.ylabel('Importance')
plt.show()


plt.figure(figsize=(200, 100))
tree.plot_tree(best_dt, feature_names=features, filled=True)
plt.title('Decision Tree Structure')
plt.show()
Non-numeric columns: ['Passes Attempted', 'Passes Completed', 'Completion Percentage', 'TD Passes', 'Ints']
Best Parameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2}
Mean Squared Error: 238.84329167011956
R-squared: 0.954232577202681
No description has been provided for this image
No description has been provided for this image

Conclusions & Insights(LUKE) :¶

The Decision Tree Regressor has provided valuable insights into the factors affecting a quarterback's average passing yards per game. The key takeaways are that efficiency is key, experience alone is not enough and shows a more data-driven strategy. Passer rating and completion percentage are strong predictors. Simply playing more games does not significantly increase per-game averages. Using these insights, teams can create training and enhance strategies to boost quarterback performance. By focusing on the most influential factors, coaches and players can make informed decisions to improve outcomes on the field.

The second hypothesis is that receivers with taller heights average more yards per reception than shorter receivers

Null Hypothesis: Receivers with taller heights do not average more yards per reception than shorter receivers

Alternative Hypothesis: Receivers with taller heights do average more yards per reception¶

Exploratory Data Analysis (Evan)¶

In [58]:
#Merging and cleaning the data
df = pd.merge(basic_stats, career_rec, on='Player Id')
df = pd.merge(df, rec_game_logs, on='Player Id')
df = df[['Player Id', 'Height (inches)', 'Receptions_x', 'Receiving Yards_x']]
df['Receiving Yards_x'] = pd.to_numeric(df['Receiving Yards_x'], errors='coerce')
df['Receptions_x'] = pd.to_numeric(df['Receptions_x'], errors='coerce')
df = df[df['Receptions_x'] >= 40]
df = df[df['Receiving Yards_x'] >= 500]
df = df.dropna()

df = df.groupby('Player Id').mean().reset_index()
df['Yards Per Reception'] = df['Receiving Yards_x'] / df['Receptions_x']

Hypothesis testing: Using a two sample t-test, we obtained a p value of 0.185 meaning we fail to reject the null hypothesis.

Therefore, we come to the conclusion that receivers with taller heights do not average more yards per reception than shorter receivers.¶

In [59]:
#T Test
taller = df[df['Height (inches)'] >= 73]
shorter = df[df['Height (inches)'] < 73]
t, p = stats.ttest_ind(taller['Yards Per Reception'], shorter['Yards Per Reception'])
x_labs = ['Players shorter than 73 inches', 'Players 73 inches and taller']
y_labs = [shorter['Yards Per Reception'].mean(), taller['Yards Per Reception'].mean()]
plt.bar(x_labs, y_labs)
plt.title("Average Yards per Reception by Height Groups")
plt.ylabel("Yards per Reception")
print('t-statistic:', t)
print('p-value:', p)
t-statistic: -1.3272983734848558
p-value: 0.1851522343819193
No description has been provided for this image

Correlation Analysis: There is a very weak negative relationship between height and average yards per reception. This can be seen through the scatterplot below.¶

In [60]:
#Correlation
corr = df['Height (inches)'].corr(df['Yards Per Reception'])
#Scatter Plot
plt.scatter(df['Height (inches)'], df['Yards Per Reception'])
plt.xlabel('Height (inches)')
plt.ylabel('Yards Per Reception')
plt.title('Height vs. Yards Per Reception')

print(corr)
-0.13695887457982833
No description has been provided for this image

Summary Statistics: The summary statistics will allow us to better label our missing data and make conlcusions about our hypothesis.¶

Specifically, our two samples for the t-test were determined using the median height value from the summary statistics¶

In [61]:
#Summary
summary = df.describe()
plt.hist(df['Height (inches)'], bins=14, edgecolor='black')
plt.title('Distribution of Heights')
plt.xlabel('Height (inches)')
plt.ylabel('Frequency')
print(summary)
       Height (inches)  Receptions_x  Receiving Yards_x  Yards Per Reception
count       410.000000    410.000000         410.000000           410.000000
mean         73.336585     53.763172         725.470488            13.669289
std           2.508446      8.664600         110.465848             2.165105
min          67.000000     40.000000         504.000000             8.775862
25%          71.000000     47.000000         643.000000            12.123745
50%          73.000000     53.000000         735.666667            13.513132
75%          75.000000     59.000000         806.750000            15.084034
max          80.000000     84.000000         998.000000            21.744186
No description has been provided for this image

Machine Learning Primary Analysis (Evan)¶

ML Analysis: Regression¶

In [62]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression


X = df['Height (inches)'].values.reshape(-1, 1)
y = df['Yards Per Reception'].values

# Split the data
def split_data(X, Y, test_size=0.2, random_state=42):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size, random_state=random_state)
    return X_train, X_test, Y_train, Y_test

X_train, X_test, y_train, y_test = split_data(X, y)

# Define functions
def fit_model(X_train, Y_train):
    model = LinearRegression()
    model.fit(X_train, Y_train)
    return model

def predict_data(model, X_train, X_test):
    Y_train_pred = model.predict(X_train)
    Y_test_pred = model.predict(X_test)
    return Y_train_pred, Y_test_pred

def evaluate_model(Y_train, Y_train_pred, Y_test, Y_test_pred):
    mse_train = mean_squared_error(Y_train, Y_train_pred)
    mse_test = mean_squared_error(Y_test, Y_test_pred)
    r2_train = r2_score(Y_train, Y_train_pred)
    r2_test = r2_score(Y_test, Y_test_pred)
    return mse_train, mse_test, r2_train, r2_test

# Create polynomial features
def create_polynomial_features(X, degree):
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_poly = poly.fit_transform(X)
    return X_poly, poly

results = []

# Iterate through degrees 1 to 6
for degree in range(1, 7):

    X_poly_train, poly_transformer = create_polynomial_features(X_train, degree)
    X_poly_test = poly_transformer.transform(X_test)

    poly_model = fit_model(X_poly_train, y_train)

    Y_train_pred, Y_test_pred = predict_data(poly_model, X_poly_train, X_poly_test)

    mse_train, mse_test, r2_train, r2_test = evaluate_model(y_train, Y_train_pred, y_test, Y_test_pred)

    results.append((f'Polynomial Regression (degree {degree})', mse_train, mse_test, r2_train, r2_test))


results_df = pd.DataFrame(results, columns=['Model', 'MSE Train', 'MSE Test', 'R2 Train', 'R2 Test'])
print(results_df)
                              Model  MSE Train  MSE Test  R2 Train   R2 Test
0  Polynomial Regression (degree 1)   4.290609  5.783732  0.019195  0.013312
1  Polynomial Regression (degree 2)   4.136383  5.575286  0.054450  0.048872
2  Polynomial Regression (degree 3)   4.102971  5.693996  0.062088  0.028621
3  Polynomial Regression (degree 4)   4.056098  5.453387  0.072803  0.069668
4  Polynomial Regression (degree 5)   4.043953  5.559581  0.075579  0.051552
5  Polynomial Regression (degree 6)   4.044451  5.552693  0.075465  0.052727

Visualization¶

In [63]:
# Visualization of the curve for all degrees
plt.figure(figsize=(10, 6))
plt.scatter(X_train, y_train, color='blue', label='Train')
plt.scatter(X_test, y_test, color='orange', label='Test')

# Plot regression curves for each degree
for degree in range(1, 7):

    X_poly_train, poly_transformer = create_polynomial_features(X_train, degree)
    X_sorted = np.sort(X_train, axis=0)
    Y_poly_sorted_pred = fit_model(X_poly_train, y_train).predict(poly_transformer.transform(X_sorted))
    plt.plot(X_sorted, Y_poly_sorted_pred, label=f'Degree {degree}')

plt.xlabel('Height (inches)')
plt.ylabel('Yards Per Reception')
plt.title('Polynomial Regression: Height vs. Yards Per Reception')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

Conclusion and insights: By utilizing polynomial regression of various degrees, I was able to determine the best polynomial fit for the data. Based on the machine learning models I evaluated for polynomial degrees from 1 through 6, a degree 4 polynomial best fits our data giving us the lowest MSE and higher R squared value due to the nonlinear shape of the data. Although, the error and R squared values are still not within a range that would bring me to the conclusion that height and yards per reception are correlated. This aligns with my analysis from the correlation and p value which signified little to no relationship between the two variables¶

Exploratory Data Analysis (Owen)¶

My topic relates to the performance of NFL kickers at different attitudes. Specifically, I wanted to answer the question “Do kickers who were born at higher altitudes perform better at higher altitudes?” I feel this is an important question to answer because of the weight placed on kickers in NFL games. Often the winner of a game is decided by a last second kick, meaning that a good kicking performance directly impacts the outcome of the game. In addition, whenever a game is played in high altitude there is always talk about what effects it could have on the visiting team. Therefore we can assume that having a kicker a team knows will perform well at high altitudes would increase the chance they win high altitude games.

The third hypothesis is that kickers from higher altitudes are more accurate at higher atitudes than kickers from lower altitudes.

Read more about kicking at higher elevations: https://www.eyesonshow.com/denver-altitude-impact-field-goals-physics-football/#:~:text=Physics%20of%20Field%20Goals%20at%20High%20Altitude&text=At%20higher%20altitudes%2C%20the%20air,to%20longer%20distances%20for%20kicks.

In [64]:
#find performance of kickers at each team location

#get all needed info in one df
merged_df = pd.merge(basic_stats_df[['Birth Place', 'Player Id']], birth_elevation_df, left_on=basic_stats_df['Birth Place'], right_on=birth_elevation_df['Location'], how='inner')
merged_df = merged_df.merge(kicker_game_df[['Year', 'Player Id', 'Home or Away', 'Opponent', 'Longest FG Made', 'FGs Attempted', 'FGs Made', 'FG Percentage', 'Extra Points Attempted', 'Extra Points Made', 'Percentage of Extra Points Made']], on='Player Id', how='inner')
merged_df = merged_df.merge(kicker_career_df[['Player Id', 'Year', 'Team', 'Career FG Percentage']], on=['Player Id', 'Year'], how='inner')

#turn team names into abbreviations
merged_df['Team'] = merged_df['Team'].map(team_mapping).fillna(merged_df['Team'])

#get the location of the game
def get_location(row):
  if row['Home or Away'] == 'Home':
    return row['Team']
  else:
    return row['Opponent']

merged_df['Game Location'] = merged_df.apply(get_location, axis=1)

#remove stats from pro-bowl(APR, NPR, RIC, CRT), played at different places each year
merged_df = merged_df[~merged_df['Opponent'].isin(['APR', 'NPR', 'RIC', 'CRT'])]

#make new col for game elevation
merged_df = merged_df.merge(team_elevation_df, on='Game Location', how='inner')

#convert FG percentage to numeric
merged_df['FG Percentage'] = pd.to_numeric(merged_df['FG Percentage'], errors='coerce')
merged_df.dropna(inplace=True)

merged_df.sort_values('Game Elevation')
Out[64]:
key_0 Birth Place Player Id Unnamed: 0 Location Elevation Year Home or Away Opponent Longest FG Made FGs Attempted FGs Made FG Percentage Extra Points Attempted Extra Points Made Percentage of Extra Points Made Team Career FG Percentage Game Location Game Elevation
1625 Baton Rouge , LA Baton Rouge , LA stephengostkowski/2506922 8 Baton Rouge , LA 28.0 2009 Away NO 36 2 1 50.0 2 2 100.0 NE 83.9 NO 1.0
2917 Yankton , SD Yankton , SD adamvinatieri/2503471 20 Yankton , SD 368.0 1998 Away NO 49 3 3 100.0 3 3 100.0 NE 79.5 NO 1.0
1425 Omaha , NE Omaha , NE dancarpenter/2507401 6 Omaha , NE 331.0 2009 Away NO 41 1 1 100.0 1 1 100.0 MIA 89.3 NO 1.0
1445 Omaha , NE Omaha , NE dancarpenter/2507401 6 Omaha , NE 331.0 2008 Away NO 0 1 0 0.0 2 2 100.0 MIA 84.0 NO 1.0
1487 Baton Rouge , LA Baton Rouge , LA stephengostkowski/2506922 8 Baton Rouge , LA 28.0 2015 Away NO 36 3 2 66.7 2 2 100.0 NE 91.7 NO 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1162 Mayfield Heights , OH Mayfield Heights , OH mattprater/2506677 5 Mayfield Heights , OH 329.0 2013 Home WAS 19 1 1 100.0 6 6 100.0 DEN 96.2 DEN 1684.0
3709 Orange , TX Orange , TX mattbryant/2504797 27 Orange , TX 8.0 2008 Away DEN 33 2 2 100.0 1 1 100.0 TB 84.2 DEN 1684.0
1158 Mayfield Heights , OH Mayfield Heights , OH mattprater/2506677 5 Mayfield Heights , OH 329.0 2013 Home PHI 53 1 1 100.0 7 7 100.0 DEN 96.2 DEN 1684.0
1177 Mayfield Heights , OH Mayfield Heights , OH mattprater/2506677 5 Mayfield Heights , OH 329.0 2012 Home SF 53 1 1 100.0 3 3 100.0 DEN 81.3 DEN 1684.0
1232 Mayfield Heights , OH Mayfield Heights , OH mattprater/2506677 5 Mayfield Heights , OH 329.0 2010 Home STL 49 2 2 100.0 3 3 100.0 DEN 88.9 DEN 1684.0

3221 rows × 20 columns

Read more about heat maps: https://www.atlassian.com/data/charts/heatmap-complete-guide

In [65]:
#create heat map that shows how kickers born at different elevations perform at different elevations



#prepare the data for the heatmap
heatmap_data = merged_df.pivot_table(values='FG Percentage',
                                      index='Game Elevation',
                                      columns='Elevation',
                                      aggfunc='mean')

#create a custom color map from green to red
cmap = sns.diverging_palette(0, 120, as_cmap=True)

#create heatmap
plt.figure(figsize=(14, 10))  #increase figure size
sns.heatmap(heatmap_data,
            cmap=cmap,  #use custom color map
            annot=True,  #enable annotations
            fmt='.1f',  #format for annotations
            annot_kws={"size": 8},  #adjust annotation size
            linewidths=0.5,  #width of lines that will divide each cell
            cbar_kws={'label': 'Average Kicker Accuracy (%)',
                       'ticks': np.arange(0, 101, 10)})  #color bar label and ticks

#label axes and title
plt.xlabel('Kicker Birthplace Elevation (m)', fontsize=16)
plt.ylabel('Game Elevation (m)', fontsize=16)
plt.title('Heatmap of Kicker Accuracy by Elevations', fontsize=18, fontweight='bold')

#customize ticks
plt.xticks(fontsize=12, rotation=45)  # Rotate x-ticks for better visibility
plt.yticks(fontsize=12)

#set major ticks to reduce clutter
plt.locator_params(axis='x', nbins=10)  # Reduce the number of x-ticks
plt.locator_params(axis='y', nbins=10)  # Reduce the number of y-ticks

#show plot
plt.tight_layout()  #adjust layout for better fit
plt.show()
No description has been provided for this image
In [66]:
#scatter plot that shows birth elevation vs kicker accuracy at above the average U.S. elevation of 763

#filter the DataFrame for the specified game elevation
median_elevation = merged_df['Game Elevation'].median()
filtered_df = merged_df[merged_df['Game Elevation'] >= median_elevation]

#data for the scatter plot
birthplace_elevation = filtered_df['Elevation']
kicker_accuracy = filtered_df['FG Percentage']

#create scatter plot
plt.scatter(birthplace_elevation, kicker_accuracy,
            color='blue',
            alpha=0.7,  #set transparency
            edgecolors='w')  #add white edges to points

#label axes and title
plt.xlabel('Kicker Birthplace Elevation (m)', fontsize=14)
plt.ylabel('Kicker Accuracy (%)', fontsize=14)
plt.title('Kicker Accuracy vs Birth Elevation Above Median Elevation', fontsize=16, fontweight='bold')

#customize ticks
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

#add grid for better visibility
plt.grid(color='grey', linestyle='--', linewidth=0.5)

#show plot
plt.show()
No description has been provided for this image

Hypothesis Testsing:

Group 1: Kickers born below the median elevation of games played.

Group 2: Kickers born above the median elevation of games played.

H0: There is no significant difference in kicker accuracy at elevation above the median elevation of games played for the two groups.

Ha: There is a significant difference in kicker accuracy at elevation above the median elevation of games played for the two groups.

In [67]:
#do hypothesis testing

from scipy import stats

#calculate the median game elevation
median_game_elevation = merged_df['Game Elevation'].median()

#split the data into two groups based on if they were born above or below median game elevation
group_below_median = merged_df[merged_df['Elevation'] < median_game_elevation]
group_above_median = merged_df[merged_df['Elevation'] >= median_game_elevation]

#get accuracy for both groups
accuracy_below = group_below_median['FG Percentage']
accuracy_above = group_above_median['FG Percentage']

#perform t-test, want to compare means of two groups and have sample size > 30
t_stat, p_value = stats.ttest_ind(accuracy_below, accuracy_above)
test_type = "t-test"

#print results
print(f"P-Value: {p_value}")
P-Value: 0.9912679241998194

Using a significance level of 0.05 we fail to reject the null hypothesis that there is a difference in accruacy between the two groups at above the median game elevation. This is because 0.76 is not less than 0.05.

Read more about hypothesis testing: https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/animal-science/hypothesis-tests/introduction-to-hypothesis-testing-and-confidence-intervals.html

Correlation analysis:

Very low negative correlation which shows that birth elevation has little to no relationship with kicking performance.

In [68]:
#create scatter plot to show overall performance of kickers from different elevations

#data for the scatter plot
birth_elevation = merged_df['Elevation']
career_fg_percentage = merged_df['Career FG Percentage']

#calculate correlation between birth elevation anf field goal percentage
correlation = merged_df['Elevation'].corr(merged_df['Career FG Percentage'])
print(f"Correlation between Birth Elevation and Career FG Percentage: {correlation}")

#create scatter plot
plt.scatter(birth_elevation, career_fg_percentage, alpha=0.6)

#label axes and title
plt.xlabel('Kicker Birth Elevation (m)')
plt.ylabel('Career FG Percentage (%)')
plt.title('Scatter Plot of Career FG Percentage vs Birth Elevation')

#show plot
plt.grid()
plt.show()
Correlation between Birth Elevation and Career FG Percentage: -0.07681920437272324
No description has been provided for this image

Read more about correlation: https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/11-correlation-and-regression

Dataset descriptive analysis:

In [69]:
#print summary statistics for dataset
print(merged_df.describe())

#generate histogram for game elevations
plt.hist(merged_df['Elevation'])
plt.ylabel('Number of Kickers')
plt.xlabel('Elevation (m)')
plt.title('Count of Kickers From Varying Elevations')
plt.show()
        Unnamed: 0    Elevation         Year  FG Percentage  \
count  3221.000000  3221.000000  3221.000000    3221.000000   
mean     12.780503   234.675877  2011.137845      83.349922   
std       9.312714   254.228598     4.504546      28.901746   
min       0.000000     7.000000  1996.000000       0.000000   
25%       4.000000    23.000000  2009.000000      66.700000   
50%      11.000000   174.000000  2012.000000     100.000000   
75%      20.000000   368.000000  2015.000000     100.000000   
max      30.000000   979.000000  2016.000000     100.000000   

       Career FG Percentage  Game Elevation  
count           3221.000000     3221.000000  
mean              83.428469      213.067991  
std                7.714799      356.085307  
min                0.000000        1.000000  
25%               79.400000       51.000000  
50%               84.200000      135.000000  
75%               88.900000      225.000000  
max              100.000000     1684.000000  
No description has been provided for this image

From this graph we can tell that while we have a good range of kickers from varying elevations, the data is heavily favored in lower elevations. This can be explained by there being less places of high elevation.

Machine Learning Primary Analysis (Owen)¶

I will use a clustering model to group the kickers. The first step it to find k with the elbow method.

I decided to use a clustering model to help determine if kickers born at higher altitudes perform better there. I chose clustering because if there was a positive relationship between birth altitude and performance at high altitude I should be able to see a clear cluster of players with high performance and high birth elevations. For the specific model I chose to use a K means clustering model. To find a good K value I used the elbow method and also checked the silhouette scores at different K values. After doing both of these methods I settled on K = 3.

In [70]:
#standardize data
features = merged_df[['Elevation', 'Game Elevation', 'FG Percentage',
                      'Longest FG Made', 'Extra Points Made', 'Career FG Percentage']]
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

inertia = []

# Test different numbers of clusters
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(features_scaled)
    inertia.append(kmeans.inertia_)

# Plot the elbow curve
plt.plot(range(1, 11), inertia, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()
No description has been provided for this image

Next I will use silhouette scores to get more info on the best K value.

In [71]:
# List to store silhouette scores
silhouette_scores = []

# Test different values of k (number of clusters)
k_values = range(2, 11)  # Start from 2 because silhouette score requires at least 2 clusters
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    clusters = kmeans.fit_predict(features_scaled)  # features_scaled is your standardized data
    score = silhouette_score(features_scaled, clusters)
    silhouette_scores.append(score)

# Plot the silhouette scores
plt.plot(k_values, silhouette_scores, marker='o', linestyle='-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score vs Number of Clusters')
plt.show()
No description has been provided for this image

Visualization (Owen)¶

For my analysis I decided to use a 3d scatter plot where each point represents a game played by a kicker. I chose a scatter plot so I could plot each game and visualize the clusters that were created by the K means clustering model. I chose to do a 3d plot because I needed to see how both birth elevation and game elevation affect the performance of a kicker. To do this I put game elevation on the x axis, birth elevation on the y axis, and kicker FG percentage on the z axis.

From these I will use k = 3. Now to do the actual clustering.

In [73]:
# Fit the model with the chosen number of clusters
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(features_scaled)

# Add the cluster labels to the original dataframe
merged_df['Cluster'] = clusters

# Define distinct colors for each cluster (adjust as needed)
cluster_colors = ['red', 'green', 'blue']

# Add a 3D plot
fig = plt.figure(figsize=(20, 16))
ax = fig.add_subplot(111, projection='3d')

# Plot the data with clusters using individual colors
for cluster in range(3):  # Assuming 3 clusters
    cluster_data = merged_df[merged_df['Cluster'] == cluster]
    ax.scatter(cluster_data['Elevation'],
               cluster_data['Game Elevation'],
               cluster_data['FG Percentage'],
               color=cluster_colors[cluster], label=f'Cluster {cluster}')

# Labels and title
ax.set_xlabel('Elevation')
ax.set_ylabel('Game Elevation')
ax.set_zlabel('FG Percentage')
ax.set_title('3D Cluster Plot')

# Show the legend to identify cluster colors
ax.legend()

plt.show()

# Calculate silhouette score using the features from the model (in your case, the scaled features)
score = silhouette_score(features_scaled, clusters)
print(f'Silhouette Score: {score}')
No description has been provided for this image
Silhouette Score: 0.3922277414278905

Read more about K means clustering: https://www.ibm.com/topics/k-means-clustering

From the visualization we can see the three clusters. One at low game elevation, low FG percentage, and varying birth elevation. Another cluster can be seen at low game elevation, mid to high FG percentage and varying birth elevation. Finally there is a third cluster with high game elevation, and varying birth elevation and FG percentage. In this third cluster we see slightly less data points at high birth elevation and low FG percentage but nothing significant. In addition, there are less data points with high birth elevations, which could make the results unreliable. Due to the variability in the third group’s birth elevation we can say that being born at a higher elevation does not give an advantage when kicking field goals at a higher elevation.

Insights and Conclusion (Owen)¶

From the data and machine learning analysis we can conclude that being born at a higher elevation does not give an advantage when kicking at higher altitudes. For NFL teams this means that instead of looking for a good kicker that was born at higher altitude they should instead just focus on getting the best overall kicker. Getting the best overall kicker will prepare a team better than trying to get a specialized kicker or high altitude.

Overall Conclusion¶

Overall throughout this project we gained valuable skills on how to find, process, and gain insights from data. We found that experience is not much of an indicator of QB efficiency as we would have thought, there was no correlation between heights and yards per reception, and being born in a high altitude does not make you a better kicker at high altitudes. Overall the was a very educational experience that pushed us to grow and learn new things.